The purpose of this notebook is twofold. First, it demonstrates the basic functionality of PyLogit for estimating nested logit models. Secondly, it compares the nested logit capabilities of PyLogit with Python Biogeme. The dataset used is the SwissMetro dataset from http://biogeme.epfl.ch/examples_swissmetro.html. For an explanation of the variables in the dataset, see http://www.strc.ch/conferences/2001/bierlaire1.pdf
In [1]:
from collections import OrderedDict # For recording the model specification
import pandas as pd # For file input/output
import numpy as np # For vectorized math operations
import statsmodels.tools.numdiff as numdiff # For numeric hessian
import scipy.linalg # For matrix inversion
import pylogit as pl # For choice model estimation
from pylogit import nested_logit as nl # For nested logit convenience funcs
In [2]:
# Load the raw swiss metro data
# Note the .dat files are tab delimited text files
swissmetro_wide = pd.read_table("../data/swissmetro.dat", sep='\t')
Note that the 09NestedLogit.py file provided is an example from Python Biogeme (see: http://biogeme.epfl.ch/examples_swissmetro.html). The 09NestedLogit.py file excludes observations meeting the following critera:
exclude = (( PURPOSE != 1 ) * ( PURPOSE != 3 ) + ( CHOICE == 0 )) > 0As a result, their dataset has 6,768 observations. Below, I make the same exclusions.
In [3]:
# Select obervations whose choice is known (i.e. CHOICE != 0)
# **AND** whose PURPOSE is either 1 or 3
include_criteria = (swissmetro_wide.PURPOSE.isin([1, 3]) &
(swissmetro_wide.CHOICE != 0))
# Use ".copy()" so that later on, we avoid performing operations
# on a view of a dataframe as opposed to on an actual dataframe
clean_sm_wide = swissmetro_wide.loc[include_criteria].copy()
# Look at how many observations we have after removing unwanted
# observations
final_num_obs = clean_sm_wide.shape[0]
num_obs_statement = "The cleaned number of observations is {:,.0f}."
print (num_obs_statement.format(final_num_obs))
In [4]:
# Create a custom id column that ignores the fact that this is a
# panel/repeated-observations dataset, and start the "custom_id" from 1
clean_sm_wide["custom_id"] = np.arange(clean_sm_wide.shape[0], dtype=int) + 1
In [5]:
# Look at the columns of the swissmetro data
clean_sm_wide.columns
Out[5]:
In [6]:
# Create the list of individual specific variables
ind_variables = clean_sm_wide.columns.tolist()[:15]
# Specify the variables that vary across individuals **AND**
# across some or all alternatives
alt_varying_variables = {u'travel_time': dict([(1, 'TRAIN_TT'),
(2, 'SM_TT'),
(3, 'CAR_TT')]),
u'travel_cost': dict([(1, 'TRAIN_CO'),
(2, 'SM_CO'),
(3, 'CAR_CO')]),
u'headway': dict([(1, 'TRAIN_HE'),
(2, 'SM_HE')]),
u'seat_configuration': dict([(2, "SM_SEATS")])}
# Specify the availability variables
availability_variables = dict(zip(range(1, 4), ['TRAIN_AV', 'SM_AV', 'CAR_AV']))
# Determine the columns that will denote the
# new column of alternative ids, and the columns
# that denote the custom observation ids and the
# choice column
new_alt_id = "mode_id"
obs_id_column = "custom_id"
choice_column = "CHOICE"
In [7]:
# Perform the desired conversion
long_swiss_metro = pl.convert_wide_to_long(clean_sm_wide,
ind_variables,
alt_varying_variables,
availability_variables,
obs_id_column,
choice_column,
new_alt_id_name=new_alt_id)
# Look at the first 9 rows of the long-format dataframe
long_swiss_metro.head(9).T
Out[7]:
In [8]:
# Scale both the travel time and travel cost by 100
long_swiss_metro["travel_time_hundredth"] = (long_swiss_metro["travel_time"] /
100.0)
# Figure out which rows correspond to train or swiss metro
# alternatives for individuals with GA passes. These individuals face no
# marginal costs for a trip
train_pass_train_alt = ((long_swiss_metro["GA"] == 1) *
(long_swiss_metro["mode_id"].isin([1, 2]))).astype(int)
# Note that the (train_pass_train_alt == 0) term accounts for the
# fact that those with a GA pass have no marginal cost for the trip
long_swiss_metro["travel_cost_hundredth"] = (long_swiss_metro["travel_cost"] *
(train_pass_train_alt == 0) /
100.0)
In [9]:
# Specify the nesting values
nest_membership = OrderedDict()
nest_membership["Future Modes"] = [2]
nest_membership["Existing Modes"] = [1, 3]
In [10]:
# Create the model's specification dictionary and variable names dictionary
# NOTE: - Keys should be variables within the long format dataframe.
# The sole exception to this is the "intercept" key.
# - For the specification dictionary, the values should be lists
# or lists of lists. Within a list, or within the inner-most
# list should be the alternative ID's of the alternative whose
# utility specification the explanatory variable is entering.
example_specification = OrderedDict()
example_names = OrderedDict()
# Note that 1 is the id for the Train and 3 is the id for the Car.
# The next two lines are placing alternative specific constants in
# the utility equations for the Train and for the Car. The order
# in which these variables are placed is chosen so the summary
# dataframe which is returned will match that shown in the HTML
# file of the python biogeme example.
example_specification["intercept"] = [3, 1]
example_names["intercept"] = ['ASC Car', 'ASC Train']
# Note that the names used below are simply for consistency with
# the coefficient names given in the Python Biogeme example.
# example_specification["travel_cost_hundredth"] = [[1, 2, 3]]
# example_names["travel_cost_hundredth"] = ['B_COST']
example_specification["travel_cost_hundredth"] = [[1, 2, 3]]
example_names["travel_cost_hundredth"] = ['B_COST']
example_specification["travel_time_hundredth"] = [[1, 2, 3]]
example_names["travel_time_hundredth"] = ['B_TIME']
One main difference between the nested logit implementation in PyLogit and in Python Biogeme or mLogit in R is that PyLogit reparameterizes the 'standard' nested logit model. In particular, one standard reperesntation of the nested logit model is in terms of the inverse of the 'scale' parameter for each nest (see for example the representation given by Kenneth Train in section 4.2 here). The 'scale' parameter has domain from zero to infinity, therefore the inverse of the scale parameter has the same domain.
However, for econometric purposes (such as conforming to the assumptions that individuals are making choices through a utility maximizing decision protocol), the scale parameter of a 'lower level nest' is constrained to be greater than or equal to 1 (assuming that the 'upper level nest' is constrained to 1.0 for identification purposes). The inverse of the scale parameter would then be constrained to be between 0.0 and 1.0 in this case. In order to make use of unconstrained optimization algorithms, we therefore estimate the logit ( i.e. $\ln \left[ \frac{\textrm{scale}^{-1}}{1.0 - \textrm{scale}^{-1}} \right]$) of the inverse of the scale parameter, assuming that the inverse of the scale parameter will lie between zero and one (and accordingly that the scale parameter be greater than or equal to one).
In [11]:
# Define a function that calculates the "logit" transformation of values
# between 0.0 and 1.0.
def logit(x):
"""
Parameters
----------
x : int, float, or 1D ndarray.
If an array, all elements should be ints or floats. All
elements should be between zero and one, exclusive of 1.0.
Returns
-------
The logit of x: `np.log(x / (1.0 - x))`.
"""
return np.log(x/(1.0 - x))
In [12]:
# Provide the module with the needed input arguments to create
# an instance of the MNL model class
example_nested = pl.create_choice_model(data=long_swiss_metro,
alt_id_col=new_alt_id,
obs_id_col=obs_id_column,
choice_col=choice_column,
specification=example_specification,
model_type="Nested Logit",
names=example_names,
nest_spec=nest_membership)
# Specify the initial nesting parameter values
# Note: This should be in terms of the reparameterized values used
# by PyLogit.
# Note: The '40' corresponds to scale parameter that is numerically
# indistinguishable from 1.0
# Note: 2.05 is the scale parameter that is estimated by PythonBiogeme
# so we invert it, then take the logit of this inverse to get the
# corresponding starting value to be used by PyLogit.
# Note the first value corresponds to the first nest in 'nest_spec'
# and the second value corresponds to the second nest in 'nest_spec'.
init_nests = np.array([40, logit(2.05**-1)])
# Specify the initial index coefficients used by PythonBiogeme
init_coefs = np.array([-0.167, -0.512, -0.899, -0.857])
# Create a single array of the initial values
init_values = np.concatenate((init_nests, init_coefs), axis=0)
# Start the model estimation from the pythonbiogeme initial values
# Note that the first value, in the initial values, is constrained
# to remain constant through the estimation process. This is because
# the first nest in nest_spec is a 'degenerate' nest with only one
# alternative, and the nest parameter of degenerate nests is not
# identified.
example_nested.fit_mle(init_values,
constrained_pos=[0])
Also, note that the functionality of using parameter constraints is restriced to the Mixed Logit and Nested Logit models at the moment. Moreover, this functionality is only relevant when using optimization method that make use of gradient information. Gradient-free estimation methods such as 'powell's' method or 'nelder-mead' will not make use of the constrained_pos keyword argument.
In [13]:
# Look at the estimated coefficients and goodness-of-fit statistics
example_nested.get_statsmodels_summary()
Out[13]:
In [14]:
# Note that the Mu (i.e the scale parameter) estimated by python biogeme is
# 1.0 / nest_coefficient where
# nest_coefficient = 1.0 / (1.0 + exp[-1 * estimated_nest_param])
pylogit_mu = 1.0 + np.exp(-1 * example_nested.params["Existing Modes Nest Param"])
print "PyLogit's estimated Mu is: {:,.4f}".format(pylogit_mu)
My parameter estimates match those of Python Biogeme.
The Python Biogeme log-likelihood is -5,236.900 and their estimated parameters are:
ASC Car: -0.167 ASC Train: -0.512 B_COST: -0.857 B_TIME: -0.899 Mu: 2.05
As shown above, my log-likelihood is -5,236.900, and my estimated parameters are:
ASC Car: -0.1672 ASC Train: -0.5119 B_COST: -0.8567 B_TIME: -0.8987 Existing Modes Nest Param: 2.0541
PyLogit's covariance estimates for the Nested Logit model are currently based on the BHHH approximation to the Fisher Information Matrix. This is the same procedure used by mlogit. However, based on the disaggreement between PyLogit's standard errors and those of Python Biogeme, Python Biogeme is clearly not using the BHHH approximation to the Fisher Information Matrix to calculate its standard errors. How does Python Biogeme calculate its standard errors?
In [15]:
# Create objects for all of the necessary arguments that are
# needed to compute the log-likelihood of the nested logit model
# given the data used in this example
nested_design = example_nested.design
mapping_res = example_nested.get_mappings_for_fit()
choice_array = long_swiss_metro["CHOICE"].values
# Create a 'convenience' function that simply returns the log-likelihood
# given a vector of coefficients
def convenient_log_likelihood(all_coefs):
log_likelihood = nl.convenient_nested_log_likelihood(all_coefs,
nested_design,
mapping_res["rows_to_obs"],
mapping_res["rows_to_nests"],
choice_array)
return log_likelihood
# Calculate the numeric hessian
numeric_hess = numdiff.approx_hess(example_nested.params.values,
convenient_log_likelihood)
# Account for the fact that the first param is constrained
numeric_hess[0, :] = 0
numeric_hess[:, 0] = 0
numeric_hess[0, 0] = -1
# Calculate the asymptotic covariance with the numeric hessian
numeric_cov = -1 * scipy.linalg.inv(numeric_hess)
# Get the numeric standard errors
numeric_std_errs = pd.Series(np.sqrt(np.diag(numeric_cov)),
index=example_nested.params.index)
# Make sure the Future Modes Nest param has a standard error of np.nan
numeric_std_errs.loc["Future Modes Nest Param"] = np.nan
# Order the numeric standard errors according to the Python Biogeme
# output
numeric_std_errs = pd.concat([numeric_std_errs[example_nested.params.index[2:]],
numeric_std_errs[example_nested.params.index[:2]]],
axis=0)
# Display the numeric standard errors
numeric_std_errs
Out[15]:
Python Biogeme Output
Name Value Std err t-test p-value ASC_CAR -0.167 0.0371 -4.50 0.00 ASC_TRAIN -0.512 0.0452 -11.33 0.00 B_COST -0.857 0.0463 -18.51 0.00 B_TIME -0.899 0.0570 -15.77 0.00 MU 2.05 0.118 17.45 0.00
From above, we see that for the index coefficients, the standard errors that are calculated using the numeric approximation of the hessian match the standard errors returned by Python Biogeme. This suggests that the standard errors of Python Biogeme, for the nested logit model, are based on a numeric differentiation approximation to the Hessian.
Below, we investigate whether the numeric approximation of the gradient via numeric differentiation is a close approximation to the analytic gradient. The premise is that if the numeric gradient does not adequately approximate the analytic gradient, then what chance does the numeric hessian have of adequately approximating the analytic hessian?
In [16]:
# Approximate the gradient using numeric differentiation
numeric_grad = numdiff.approx_fprime(example_nested.params.values,
convenient_log_likelihood)
pd.DataFrame([numeric_grad,
example_nested.gradient.values],
index=["Numeric Differentiation", "Analytic"],
columns=example_nested.params.index).T
Out[16]:
From the dataframe above, we see that the numeric gradient does not adequately approximate the analytic gradient. The numeric gradient is incorrect by a factor of 2 to 4 depending on what variable is being examined (this excludes the Future Modes Nest Param that constrained to 40).
Given that the numeric gradient does not provide a good approximation of the analytic gradient, I do not expect the numeric hessian (nor the BHHH approximation used by PyLogit) to provide a good approximation to the analytic hessian. Since the standard errors calculated from the numeric hessian matches PythonBioeme's standard errors, this suggests that Python Biogeme is calculating its hessian (and therefore standard errors) via numeric differentiation, and as suggested above, this is likely to be poor approximation to the analytic hessian.
In [ ]: